After tidying our data set, an exploratory analysis is conducted to
look for possible predictors for the Attendance
outcome.
A brief summary of attendance based on the Type variable
is provided below:
theme_park |>
group_by(Year, Type) |>
mutate(
Attendance = Attendance / 100000
) |>
summarise(sum = sum(Attendance)) |>
arrange(Type) |>
pivot_wider(
names_from = Type,
values_from = sum
) |>
knitr::kable(digits = 3, caption = c("Summary of Attendance for Three Types of Facilities From 2019 to 2022"))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
| Year | Amusement/Theme Park | Museum | Water Park |
|---|---|---|---|
| 2019 | 37996.4 | 20100.8 | 5898.9 |
| 2020 | 13031.1 | 4664.5 | 2313.5 |
| 2021 | 22463.7 | 6459.0 | 3473.5 |
| 2022 | 21280.8 | 11603.3 | 4678.3 |
From this table, some observed patterns are:
The distribution of data by year is further visualized into the box plots below:
theme_park |>
group_by(Year) |>
plot_ly(y = ~Attendance, color = ~Year, type = "box", colors = "viridis") |>
layout(annotations =
list(x = 1, y =1, text = "Plot 1: Distribution of Attendance by Year",
showarrow = F, xref='paper', yref='paper',
xanchor='right', yanchor='auto', xshift=0, yshift=0,
font=list(size=15))
)
Next, we specifically look at the trend of Attendance
from 2019 to 2022 based on the Region variable.
theme_park|>
group_by(Region, Year) |>
summarize(attend_sum = mean(Attendance)) |>
plot_ly(x = ~Year, y = ~attend_sum, color = ~Region,
type = "scatter", mode = 'point', colors = "viridis") |>
layout(annotations =
list(x = 1, y = 1, text = "Plot 2: Change in Attendance for Each Region",
showarrow = F, xref='paper', yref='paper',
xanchor='right', yanchor='auto', xshift=0, yshift=0,
font=list(size=15))
)
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
* Similar attendance fluctuation for most regions
* Only EMEA has an attendance drop from 2021 to 2022
## ANOVA TEST
### Based on Type of Facilities
The first ANOVA test focuses on the `Type` variable in our data set. The null hypothesis and alternative hypothesis are listed as follow:
$$H_0: \mu_{\text{Amusement/Theme Park}} = \mu_{\text{Water Park}} = \mu_{\text{Museum}} ~~ \text{vs} ~~ H_1: \text{at least two means are not equal}$$
```r
anova_1 = aov(Attendance ~ Type, data = theme_park)
summary(anova_1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Type 2 1.127e+17 5.635e+16 105.3 <2e-16 ***
## Residuals 737 3.944e+17 5.351e+14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With a p-value of less than 2e-16, we would reject the null
hypothesis. We have evidence that at least two of the means are not
equal. Meaning the mean attendance among type groups is different for at
least two groups in the Type variable.